Context - Dependent Conflation , Text Filtering and Clustering
نویسندگان
چکیده
The presence of trivial words in text databases can impact record or concept (words/ phrases) clustering adversely. Additionally, the determination of whether a word/ phrase is trivial is context-dependent. The objective of the present paper is to demonstrate a context-dependent trivial word filter to improve clustering quality. Factor analysis was used as a context-dependent trivial word filter for subsequent term clustering. Medline records for Raynaud’s Phenomenon were used as the database, and words were extracted from the record Abstracts. A factor matrix of these words was generated, and the words that had low factor loadings across all factors were identified, and eliminated. The remaining words, which had high factor loading values for at least one factor and therefore were influential in determining the theme of that factor, were input to the clustering algorithm. Both quantitative and qualitative analyses were used to show that factor matrix filtering leads to higher quality clusters and subsequent taxonomies. Report Documentation Page Form Approved
منابع مشابه
A Survey of Text Clustering Algorithms
Clustering is a widely studied data mining problem in the text domains. The problem finds numerous applications in customer segmentation, classification, collaborative filtering, visualization, document organization, and indexing. In this chapter, we will provide a detailed survey of the problem of text clustering. We will study the key challenges of the clustering problem, as it applies to the...
متن کاملMining Text Data Mining Text Data
Clustering is a widely studied data mining problem in the text domains. The problem finds numerous applications in customer segmentation, classification, collaborative filtering, visualization, document organization, and indexing. In this chapter, we will provide a detailed survey of the problem of text clustering. We will study the key challenges of the clustering problem, as it applies to the...
متن کاملAcronyms as an Integral Part of Multi-Word Term Recognition - A Token of Appreciation
Term conflation is the process of linking together different variants of the same term. In automatic term recognition approaches, all term variants should be aggregated into a single normalized term representative, which is associated with a single domain–specific concept as a latent variable. In a previous study, we described FlexiTerm, an unsupervised method for recognition of multi–word term...
متن کاملImproving Web Service Clustering through Post Filtering to Bootstrap the Service Discovery
Web service clustering is one of a very efficient approach to discover Web services efficiently. Current approaches use similarity-distance measurement methods such as string-based, corpus-based, knowledge-based and hybrid methods. These approaches have problems that include discovering semantic characteristics, loss of semantic information, shortage of high-quality ontologies and encoding fine...
متن کاملAn Efficient Technique to Improve Snippet Clustering
Document clustering is an effective tool to manage information overload. By grouping similar documents together, we enable a human observer to quickly browse large document collections, make it possible to easily grasp the distinct topics and subtopics. In this Paper we survey the most important problems and techniques related to text information retrieval: document pre-processing and filtering...
متن کامل